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Abstract 

The group Lasso is an extension of the Lasso for feature selection 
on (predefined) non-overlapping groups of features. The non-overlapping 
group structure limits its applicability in practice. There have been sev- 
eral recent attempts to study a more general formulation, where groups 
of features are given, potentially with overlaps between the groups. The 
resulting optimization is, however, much more challenging to solve due 
to the group overlaps. In this paper, we consider the efficient optimiza- 
tion of the overlapping group Lasso penalized problem. We reveal several 
key properties of the proximal operator associated with the overlapping 
group Lasso, and compute the proximal operator by solving the smooth 
and convex dual problem, which allows the use of the gradient descent type 
of algorithms for the optimization. We have performed empirical evalua- 
tions using the breast cancer gene expression data set, which consists of 
8,141 genes organized into (overlapping) gene sets. Experimental results 
demonstrate the efficiency and effectiveness of the proposed algorithm. 



1 Introduction 

Problems with high dimensionality have become common over the recent years. 
The high dimensionality poses significant challenges in building interpretable 
models with high prediction accuracy. Regularization has been commonly em- 
ployed to obtain more stable and interpretable models. A well-known example 
is the penalization of the l\ norm of the estimator, known as Lasso [35]. The l\ 
norm regularization has achieved great success in many applications. However, 
in some applications |27j . we are interested in finding important explanatory 
factors in predicting the response variable, where each explanatory factor is 
represented by a group of input features. In such cases, the selection of impor- 
tant features corresponds to the selection of groups of features. As an extension 
of Lasso, group Lasso [27] based on the combination of the i\ norm and the £2 
norm has been proposed for group feature selection, and quite a few efficient 
algorithms [HJ [T31 US] have been proposed for efficient optimization. However, 
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the non-overlapping group structure in group Lasso limits its applicability in 
practice. For example, in microarray gene expression data analysis, genes may 
form overlapping groups as each gene may participate in multiple pathways [5] . 

Several recent work [3J [5J [TTJ [25] studies the overlapping group Lasso, where 
groups of features are given, potentially with overlaps between the groups. The 
resulting optimization is, however, much more challenging to solve due to the 
group overlaps. When optimizing the overlapping group Lasso problem, one can 
reformulate it as a second order cone program and solve it by the generic tool- 
boxes, which, however, does not scale well. In [3], an alternating algorithm called 
SLasso is proposed for solving the equivalent reformulation. However, SLasso 
involves an expensive matrix inversion at each alternating iteration, and there 
is no known global convergence rate for such an alternating procedure. It was 
recently shown in [10] that, for the tree structured group Lasso, the associated 
proximal operator (or equivalently, the Moreau-Yosida reguralization) [TTJ [53] 
can be computed by applying block coordinate ascent in the dual and the al- 
gorithm converges in one pass. It was shown independently in [14] that the 
proximal operator associated with the tree structured group Lasso has a nice 
analytical solution. However, to the best of our knowledge, there is no analytical 
solution to the proximal operator associated with the general overlapping group 
Lasso. 

In this paper, we develop an efficient algorithm for the overlapping group 
Lasso penalized problem via the accelerated gradient descent method. The ac- 
celerated gradient descent method has recently received increasing attention in 
machine learning due to the fast convergence rate even for nonsmooth convex 
problems. One of the key operations is the computation of the proximal opera- 
tor associated with the penalty. We reveal several key properties of the proximal 
operator associated with the overlapping group Lasso penalty, and compute the 
proximal operator by solving the dual problem. The main contributions of this 
paper include: (1) we develop a procedure to identify many zero groups in the 
proximal operator, which dramatically reduces the size of the dual problem to be 
solved; (2) we show that the dual problem is smooth and convex with Lipschitz 
continuous gradient, thus can be solved by existing smooth convex optimization 
tools; and (3) we derive the duality gap between the primal and dual problems, 
which can be used to check the quality of the solution and determine the con- 
vergence of the algorithm. We have performed empirical evaluations using the 
breast cancer gene expression data set, which consists of 8,141 genes organized 
into (overlapping) gene sets. Experimental results demonstrate the efficiency 
and effectiveness of the proposed algorithm. 

Notations: || ■ || denotes the Euclidean norm, and denotes a vector of 
zeros. SGN(-) and sgn(-) are defined in a componentwise fashion as: 1) if t = 0, 
then SGN(t) = [-1,1] and sgn(t) = 0; 2) if t > 0, then SGN(t) = {1} and 
sgn(t) = 1; and 3) if t < 0, SGN(i) = {-1} and sgn(t) = -1. G % C {1, 2, . . . ,p] 
denotes an index set, and x<3 i denote a subvector of x consisting of the entries 
indexed by Gj. 
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2 The Overlapping Group Lasso 



We consider the following overlapping group Lasso penalized problem: 

min/(x)=;(x) + ^(x) (1) 

where /(•) is a smooth convex loss function, e.g., the least squares loss, 

^(x) = A 1 ||x|| 1 + A 2 ^«; < ||x Gi || (2) 

is the overlapping group Lasso penalty, Ai > and A2 > are regularization 
parameters, w, > 0, i = 1,2,..., g, Gi C {1, 2, . . . ,p} contains the indices cor- 
responding to the i-th group of features, and || • || denotes the Euclidean norm. 
The g groups of features are pre-specified, and they may overlap. The penalty 
in (|2|) is a special case of the more general Composite Absolute Penalty (CAP) 
family 28 . When the groups are disjoint with Ai = and A2 > 0, the model 
in ([1]) reduces to the group Lasso [2"7] . If Ai > and A2 = 0, then the model in 
([IJ reduces to the standard Lasso |22) . 

In this paper, we propose to make use of the accelerated gradient descent 
(AGD) [TJ[T8l[T9] f° r so l vm g QJj due to its fast convergence rate. The algorithm 
is called "FoGLasso", which stands for Fast overlapping Group Lasso. One 
of the key steps in the proposed FoGLasso algorithm is the computation of the 
proximal operator associated with the penalty in @; and we present an efficient 
algorithm for the computation in the next section. 

In FoGLasso, we first construct a model for approximating /(•) at the point 
x as: 

JlAy) = [Z« + (i'(x),y - x>] + <^(y) + ~||y - x|| 2 , (3) 

where L > 0. The model /i, x (y) consists of the first-order Taylor expansion of 
the smooth function l(-) at the point x, the non-smooth penalty <^(x), and a 
regularization term -|||y — xj| 2 . Next, a sequence of approximate solutions {x 4 } 
is computed as follows: 

x i+ i = axgmin/z, S4 (y) (4) 
y 

where the search point Sj is an affine combination of Xj_i and x, as 

Sj = Xi + /3j(x» - Xi_i), (5) 

for a properly chosen coefficient Li is determined by the line search accord- 
ing to the Armijo-Goldstein rule so that Li should be appropriate for Sj, i.e., 
/(x i+1 ) < fzsi ( x i+i)- Following the analysis in [H [TS] , we can show that Fo- 
GLasso achieves a convergence rate of 0(1/ k 2 ) for k iterations, which is optimal 
among first-order methods. A key building block in FoGLasso is the minimiza- 
tion of (j3]), whose solution is known as the proximal operator [3 [17]. The 
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computation of the proximal operator is the main technical contribution of this 
paper. The pseudo-code of FoGLasso is summarized in Algorithm [TJ where the 
proximal operator ir(-) is defined in ©. In practice, we can terminate Algo- 
rithm [T] if the change of the function values corresponding to adjacent iterations 
is within a small value, say 10~ 5 . 



Algorithm 1 The FoGLasso Algorithm 
Input: Lq > 0,xo,fc 
Output: Xfe + i 

l: Initialize xi = xo, a_i = 0, ao = 1, and L = Lq. 

2: for i = 1 to k do 

3: Set Pi = , Si = Xi + Pifa - Xi_i) 

4: Find the smallest L = 2 5 'Lj_i,j = 0, 1, . . . such that /(xj+i) < /l iB4 (x»+i) 
holds, where x i+1 = 7rM£(s* - ^/'( s »)) 

5: Set Li = L and aj+i = — ^ 1 

6: end for 



3 The Associated Proximal Operator and Its Ef- 
ficient Computation 

The proximal operator associated with the overlapping group Lasso penalty is 
defined as follows: 

tt£(v) - arg mm L&(x) = |||x - v|| 2 + AiHx^ + A 2 ^ w t ||x Gi || | , (6) 

which is a special case of (JTJ by setting Z(x) = |||x — v|| 2 . It can be verified that 
the approximate solution n+i in Q is given by x»+i = tt^/l *(s, — -^M'(si)). 
Recently, it has been shown in [TQl [Mj [15] that, the efficient computation of 
the proximal operator is key to many sparse learning algorithms [15[ Section 2] . 
Next, we focus on the efficient computation of tt^ (v) in ([6]) for a given v. 

3.1 Key Properties of the Proximal Operator 

We first reveal several basic properties of the proximal operator 7r^(v). 

Lemma 1. Suppose that Ai,Aa > 0, and Wi > 0, for i — 1,2, ...,<?. Lei 
x* = 7r^(v). TTie following holds: 1) if Vi > 0, then < x* < Hjj z/«j < 0, 
then v, < a;* < 0; 3) if v t = 0, then x* = 0; 4) SGN(v) C SGN(x*); and 5) 
< 1 (v) = Bgn(v)0 7r^(|v|). 

Proof. When Ai, A2 > 0, and Wi > 0, for i = 1, 2, . . . , g, the objective function 
9y{') is strictly convex, thus x* is the unique minimizer. We first show if 
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Vi > 0, then < x* < Vi. If x* > Vi, then we can construct a x as follows: 
ij = x*,j ^ i and Xi = Vi. Similarly, if x* < 0, then we can construct a x 
as follows: £j = x*,j ^ i and x.- L — 0. It is easy to verify that x achieves a 
lower objective function value than x* in both cases. We can prove the second 
and the third properties using similar arguments. Finally, we can prove the 
fourth and the fifth properties using the definition of SGN(-) and the first three 
properties. □ 

Next, we show that 7T^ 1 (-) can be directly derived from 7T° (•) by soft- 
thresholding. Thus, we only need to focus on the simple case when Ai = 0. 
This simplifies the optimization in (|6|). 

Theorem 1. Let 

u = sgn(v) 0max(|v| - Ai,0), (7) 
tt° 2 (u) = arg min j h X2 (x) = i||x - u|| 2 + A 2 ^Wi||x G J 1 . (8) 



The following holds: 



< 1 (v)=<(u). (9) 



Proof. Denote the unique minimizer of h\ 2 (•) as x* . The sufficient and necessary 
condition for the optimality of x* is: 

0eI A! (x*)=x*-u + 3^(x*), (10) 

where dh\ 2 (x) and 9</()° 2 (x) are the subdiffcrential sets of h\ 2 (-) and (t>\ 2 {') at 
x, respectively. 

To prove ©, it suffices to show e <9 9 ^(x*). The subdifferential of g\\{-) 
at x* is given by 

dg% (x*) = x* - v + d^ 2 (x*) = x* - v + A!SGN(x*) + dcj>l 2 (x*). (11) 

It follows from ([T]) that u € v AiSGN(u). Using the fourth property in 
Lemma [D we have SGN(u) C SGN(x*). Thus, 

ue v-AiSGN(x*). (12) 

It follows from ([I0])-([12} that e dg^(x*). □ 

It follows from Theorem [T] that, we only need to focus on the optimization 
of ([8]) in the following discussion. The difficulty in the optimization of (JSj) lies 
in the large number of groups that may overlap. In practice, many groups will 
be zero, thus achieving a sparse solution^- However, the zero groups are not 
known in advance. The key question we aim to address is how we can identify 
as many zero groups as possible to reduce the complexity of the optimization. 
We present in the next lemma a sufficient condition for a group to be zero. 



1 The sparse solution is much more desirable than the non-sparse one in many applications. 
For the non-sparse case, one may apply the subgradient based methods such as those proposed 
in 1201 1241 for solving JHJ, which deserves further study. 
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Lemma 2. Denote the minimizer of h\ 2 (-) in §E§ by x* . If the i-th group 
satisfies ||ug! 4 || < \2Uii, then X G . — 0, i.e., the i-th group is zero. 

Proof. We decompose h\ 2 (x) into two parts as follows: 

^a 2 (x) = {dlxc, -u Gi || 2 + A 2Wl ||x Gi ||l + i i||xo. -u s J| 2 + A 2 ^ Wj ||x Gj 

(13) 

where Gi = {1,2, ... ,p} — Gi is the complementary set of Gi. We consider 
the minimization of /ia,(x) in terms of x G . when x^ = x^ is fixed. It can 

be verified that if HugJI < \2Wi, then x G . = minimizes both terms in (|13p 
simultaneously. Thus we have x G . =0. □ 

Lemma[2]may not identify many true zero groups due to the strong condition 
imposed. The lemma below weakens the condition in Lemma [2] Intuitively, for 
a group d, we first identify all existing zero groups that overlap with Gi, and 
then compute the overlapping index subset Si of Gi as: 

St= |J (Gj n Gj). (14) 

We can show that x G . = if ||uG i -s i || < Mwi is satisfied. Note that this 
condition is much weaker than the condition in Lemma [21 which requires that 
||u G J < MWi. 

Lemma 3. Denote the minimizer of h\ 2 {-) by x*. Let Si, a subset of Gi, be 
defined in (|14[) . //||u Gi _5j| < \2uii holds, then x G . = 0. 

The proof of Lemma [3] follows similar arguments in Lemma [5] and is omit- 
ted. Lemma [3] naturally leads to an iterative procedure for identifying the zero 
groups: For each group Gi, if ||u G J < ^2Wi, then we set u Gi = 0; we cycle 
through all groups repeatedly until u does not change. 

Let p 1 — \{ui : ui 7^ 0}| be the number of nonzero elements in u, g 1 — |{u Gi : 
\iQ i 7^ 0}| be the number of the nonzero groups, and x* denote the minimizer 
of h\ 2 (-). It follows from Lemma UJ and Lemma [T] that, if itj = 0, then x* = 0. 
Therefore, by applying the above iterative procedure, we can find the minimizer 
of ([8j by solving a reduced problem that has p' < p variables and <?'<<? groups. 
With some abuse of notation, we still use (J5J to denote the resulting reduced 
problem. In addition, from Lemma [TJ we only focus on u > in the following 
discussion, and the analysis can be easily generalized to the general u. 

3.2 Reformulation as an Equivalent Smooth Convex Op- 
timization Problem 

It follows from the first property of Lemma [1] that, we can rewrite ([SJ as: 

7T^ 2 (u) = arg min |/i A2 (x) = i||x-u|| 2 + A 2 VwJxgJI 1 , (15) 

"I 1=1 ) 
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where the minimizer of h\ 2 (-) is constrained to be non-negative due to u > 0. 

Making use of the dual norm of the Euclidean norm || • ||. we can rewrite 
/ia 2 ( x ) as: 

h X2 (x) = max i||x - u|| 2 + ]T(x, Y*), (16) 

where is defined as follows: 

n = {Y e R pX9 : Y± = 0, < A a wj,« = 1,2, . . .,g}, 

Gi is the complementary set of Gi , Y is a sparse matrix satisfying l^j = if the 
z-th feature does not belong to the j-th group, i.e., i £ Gj, and Y l denotes the 
i-th column of Y. As a result, we can reformulate (|15|) as the following min-max 
problem: 

min max < ib(x, Y) = ^ llx — ull 2 + (x, Ye) \ , (17) 
xeRP:x>o Yen [ 2 J 

where e € M. 9 is a vector of ones. It is easy to verify that ipfei Y) 1S convex in x 
and concave in Y, and the constraint sets are closed convex for both x and Y. 
Thus, (|17p has a saddle point, and the min-max can be exchanged. 

It is easy to verify that for a given Y, the optimal x minimizing ^(x, Y) in 
(fT7|) is given by 

x = max(u- Ye,0). (18) 

Plugging (1181) into (JTTJ) , we obtain the following minimization problem with 
regard to Y: 

min \oj(Y) = -ib(max(u-Ye,0),Y)\ . (19) 

YeSpxn-.Yen 



Our methodology for minimizing h\ 2 (-) dehned in (jSJ is to hrst solve (fT9|) . 
and then construct the solution to h\ 2 (-) via (fl8|) . We show in Theorem [2] below 
that the function lu(-) is continuously differentiable with Lipschitz continuous 
gradient. Therefore, we convert the non-smooth problem (p~5|) to the smooth 
problem (fT9"|) . making the smooth convex optimization tools applicable. 



Theorem 2. The function ui(Y) is convex and continuously differentiable with 
u'(Y) = -max(u-re,0)e T . (20) 
In addition, oj'{Y) is Lipschitz continuous with constant g 2 , i.e., 

||a; / (i r i)-w / (y 2 )|| Ji .<ff a ||y 1 -r 2 ||i i ., vy^^eR^. (21) 

To prove Theorem [51 we first present two technical lemmas. The first lemma 
is related to the optimal value function (3] [5], and it was used in a recent 
study 25] on infinite kernel learning. 
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Lemma 4. J3j Theorem 4-.1] Let X be a metric space and U be a normed 
space. Suppose that for all x G X, the function ip(jx., •) is differentiable and that 
y) and Oy^(x, Y) (the partial derivative ofip(n, Y) with respect toY ) are 
continuous on X x U. Let $ be a compact subset of X . Define the optimal 
value function as ip(Y) = inf xg $ ^(x, K). The optimal value function f{Y) 
is directionally differentiable. In addition, if VY G U, tp(-,Y) has a unique 
minimizer x(Y) over $, then f{Y) is differentiable at Y and the gradient of 
<p(Y) is given by ip'(Y) = D Y ip(x(Y),Y). 

The second lemma shows that the operator y = max(x, 0) is non-expansive. 

Lemma 5. Vx, y G W, we have || max(x, 0) — max(y, 0)|| < ||x — y|| . 

Proof. The results follows since | max(cc, 0) — max(y, 0)| < \x — y\, Vx, y G K. □ 

Proof of Theorem [2} To prove the differentiability of w(Y), we apply 
LemmaHwith X = W, U = W pxg and$ = {xeX:u+A 2 ^i« ! e>x> 0}. It 
is easy to verify that 1) ^>(x, •) is differentiable; 2) i/j(x., Y) and Dytpi^-, Y) = xe T 
are continuous on X x U; 3) $ be a compact subset of X; and 4) VF G U, 
i/;(x, Y) has a unique minimizer x(Y) = max(u — Ye, 0) over $. Note that, 
the last result follows from u > and u — Ye < u + A2 w i e i where the 
latter inequality utilizes ||Y*|| < \2VJi; and this indicates that x(Y) = max(u — 
Ye, 0) = argmin x Y) = argmin xe $ ?/>(x, Y). It follows from Lemma|4]that 

<p(Y) = inf Mx, Y) = ?A(max(u - Ye, 0), Y) 

is differentiable with <p'(Y) = max(u — Ye, 0)e T . 

In (|17p . ip(x,Y) is convex in x and concave in Y, and the constraint sets 
are closed convex for both x and Y, thus the existence of the saddle point is 
guaranteed by the well-known von Neumann Lemma [T51 Chapter 5.1]. As a 
result, 

ip(Y) = inf V(x, Y) = ^(max(u - Ye, 0), Y) 

is concave, and oj(Y) — —(p(Y) is convex. For any Y\,Y2, we have 

||c/(Yi)-fc/(Y 2 )||ii. =||max(u-yie,0)e T -max(u-r 2 e,0)e T || F 
<||e|| x || max(u — Y\e, 0) — max(u — Y 2 e, 0)| 
<]|e|| x IK^-l-^ell (22) 
<9 2 \\Y 1 -Y 2 \\ F , 

where the second inequality follows from Lemma [5j We prove (f2Tj) . □ 
From Theorem [21 the problem in ()19[) is a constrained smooth convex opti- 
mization problem, and existing solvers for constrained smooth convex optimiza- 
tion can be applied. In this paper, we employ the accelerated gradient descent 
to solve (|19p . due to its fast convergence property. Note that, the Euclidean 
projection onto the set can be computed in the closed form. We would like to 
emphasize here that, the problem (fT9|) may have a much smaller size than <j6j> . 
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3.3 Computing the Duality Gap 

We show how to estimate the duality gap of the min-max problem (fTT|). which 
can be used to check the quality of the solution and determine the convergence 
of the algorithm. 

For any given approximate solution Y E £1 for ui(Y), we can construct the 
approximate solution x = max(u — Ye, 0) for h\ 2 (x). The duality gap for the 
min-max problem (1171) at the point (x, Y) can be computed as: 

gap(F) = maxibtx, Y) — min ib(x,Y). (23) 

YeCl x£RP:x>0 

The main result of this subsection is summarized in the following theorem: 

Theorem 3. Let gap(Y") be the duality gap defined in (|23p . Then, the following 
holds: 

gap(r) = A 2 £> i ||x Gi || - (xg,,^)). (24) 

i—l 

In addition, we have 

lj(Y) - uj(Y*) < gap(y), (25) 
h(x) - h{x*) < gap(f). (26) 

Proof. Denote (x*,Y*) as the optimal solution to the min-max problem (|TT|> . 
From (|TB |) -([T9" ]) . we have 

- lj(Y) = ip(x, Y) = min ip(x, Y) < tp(x*,Y), (27) 

x£tP:x>0 

ip(x*,Y) < maxip(x*,Y) =if>(x*,Y*) = -oj(Y*), (28) 

/ lA2 (x*)=^(x*,y*)= minj(x,r)<^,r), (29) 

rp(yt,Y*) < maxV(x,y) = h X2 (x). (30) 

Incorporating Jig, $7}-^, we prove (|2"g )) -([2"6" |) . □ 

In our experiments, we terminate the algorithm when the estimated duality 
gap is less than 10~ 10 . 



4 Experiments 

We have conducted experiments to evaluate the efficiency of the proposed al- 
gorithm using the breast cancer gene expression data set |23j , which consists of 
8,141 genes in 295 breast cancer tumors (78 metastatic and 217 non-metastatic) . 
For the sake of analyzing microarrays in terms of biologically meaningful gene 
sets, different approaches have been used to organize the genes into (overlap- 
ping) gene sets. In our experiments, we follow j8] and employ the following two 
approaches for generating the overlapping gene sets (groups) : pathways [21j and 
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Figure 1: Comparison of SLasso [9] and our proposed FoGLasso algorithm 
in terms of computational time (in seconds) when different numbers of genes 
(variables) are involved. The computation time is reported in the logarithmic 
scale. 

edges [4]. For pathways, the canonical pathways from the Molecular Signatures 
Database (MSigDB) [3T] are used. It contains 639 groups of genes, of which 637 
groups involve the genes in our study. The statistics of the 637 gene groups are 
summarized as follows: the average number of genes in each group is 23.7, the 
largest gene group has 213 genes, and 3,510 genes appear in these 637 groups 
with an average appearance frequency of about 4. For edges, the network built 
in [4] will be used, and we follow [8] to extract 42,594 edges from the network, 
leading to 42,594 overlapping gene sets of size 2. All 8,141 genes appear in the 
42,594 groups with an average appearance frequency of about 10. 

Efficiency of the Proposed FoGLasso We compare our proposed FoGLasso 
with the SLasso algorithm developed by Jenatton et al. [9] for solving (JlJ with 
the least squares loss = ^\\Ax — b|| 2 . The experimental settings are as 



follows: we set Wi — y/\G%\, and Ai = A2 = p x A™ 51 *, where \d\ denotes 




the size of the i-th group d, A5" ax = ||A T b||oo (the zero point is a solution 
to (HJ if Ai > \™ ax ), and p is chosen from the set {5 x 10 _1 ,2 x 10 _1 ,1 x 
10~\ 5 x 10~ 2 , 2 x 10~ 2 , 1 x 10~ 2 , 5 x 10~ 3 , 2 x 10~ 3 , 1 x 10~ 3 }. For a given 
p, we first run SLasso, and then run our proposed FoGLasso until it achieves 
an objective function value smaller than or equal to that of SLasso. For both 
SLasso and FoGLasso, we apply the "warm" start technique, i.e., using the 
solution corresponding to the larger regularization parameter as the "warm" 
start for the smaller one. We vary the number of genes involved, and report the 
total computational time (seconds) including all nine regularization parameters 
in Figure Q] and Table [TJ We can observe that, 1) our proposed FoGLasso is 
much more efficient than SLasso; 2) the advantage of FoGLasso over SLasso in 
efficiency grows with the increasing number of genes (variables). For example, 
with the grouping by pathways, FoGLasso is about 25 and 70 times faster than 
SLasso for 1000 and 2000 genes (variables), respectively; and 3) the efficiency on 
edges is inferior to that on pathways, due to the larger number of overlapping 
groups. These results verify the efficiency of the proposed FoGLasso algorithm 
based on the efficient procedure for computing the proximal operator presented 
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Table 1: Scalability study of the proposed FoGLasso algorithm under different 
numbers (p) of genes involved. The reported results are the total computational 
time (seconds) including all nine regularization parameter values. 

~p 3000 4000 5000 6000 7000 8141 

pathways 37.6 48.3 62.5 68.7 86.2 99.7 
edges 58.8 84.8 102.7 140.8 173.3 247.8 




2000 



Figure 2: Performance of the computation of the proximal operator in Fo- 
GLasso. The left plot shows the objective function value during the FoGLasso 
iteration. The middle plot shows the percentage of the identified zero groups 
by applying Lemma [3] The right plot shows the number of inner iterations for 
achieving the duality gap less than 10~ 10 when one solves the proximal operator 
via the dual reformulation (see Section 3.2). 



in Section [3l 

Computation of the Proximal Operator In this experiment, we run Fo- 
GLasso on the breast cancer data set using all 8,141 genes. We terminate Fo- 
GLasso if the change of the objective function value is less than 10~ 5 . We use 
the 42,594 edges to generate the overlapping groups. We obtain similar results 
for the 637 groups based on pathways. We set p = 0.01. The results are shown 
in Figure [2j The left plot shows that the objective function value decreases 
rapidly in the proposed FoGLasso. In the middle plot, we report the percentage 
of the identified zero groups by applying Lemma [3] Our experimental result 
shows that, 1) after 16 iterations, 50% of the zero groups are correctly identi- 
fied; and 2) after 50 iterations, 80% of the zero groups are identified. Therefore, 
with Lemma El we can significantly reduce the problem size of the subsequent 
dual reformulation (see Section l3~Tj) . In the right plot of Figure [21 we present 
the number of inner iterations for solving the proximal operator via the dual 
reformulation. We attribute the decreasing number of inner iterations to 1) the 
size of the reduced problem is decreasing when many zero groups are identified 
(see the middle plot); and 2) in solving the dual reformulation, we can apply 
the Y computed in the previous iteration as the "warm" start for computing 
the proximal operator in the next iteration. 

Classification Performance We compare the classification performance of 
the overlapping group Lasso with Lasso. We use 60% samples for training 
and the rest 40% for testing. To deal with the imbalance of the positive and 
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Figure 3: Comparison of overlapping group Lasso and Lasso in terms of the 
balanced error rate. The left plot shows the classification performance with 
overlapping pathways; and the right plot shows the result with the overlapping 
edges. 



negative samples, we make use of the balanced error rate [BJ, which is defined 
as the average error of two classes. We report the results averaged over 10 
runs in Figure [3] Our results show that: 1) with the overlapping pathways, 
overlapping Lasso and Lasso achieve comparable classification performance; 2) 
with the overlapping edges, overlapping Lasso outperforms Lasso; and 3) the 
performance based on edges is better than that based on the pathways in our 
experiment. 



5 Conclusion 

In this paper, we consider the efficient optimization of the overlapping group 
Lasso penalized problem based on the accelerated gradient descent method. 
We reveal several key properties of the proximal operator associated with the 
overlapping group Lasso, and compute the proximal operator via solving the 
smooth and convex dual problem. Numerical experiments on the breast cancer 
data set demonstrate the efficiency of the proposed algorithm. Our experimental 
results also show the benefit of the overlapping group Lasso in comparison with 
Lasso. In the future, we plan to apply the proposed algorithm to other real- 
world applications involving overlapping groups. 
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